Gavin Gray 12th June 2014

This notebook is to use the full list of interactions between Entrez IDs extracted from the DIP dataset to build a training set with both positive and negative examples. What should be produced are two tables, one with positive interactions in the form of:

Positive table

Protein 1 Protein 2 Interacting?
12345 54321 1
... ... 1

Negative table

Protein 1 Protein 2 Interacting?
12435 43521 0
... ... 0

In [1]:
cd /home/gavin/Documents/MRes/DIP/human/


/home/gavin/Documents/MRes/DIP/human

In [2]:
ls


DIPtouniprot.tab  flat.DIP.txt  flat.Entrez.txt  flat.uniprot.txt  Hsapi20140427.txt  interacting.DIP.txt  interacting.Entrez.txt  interacting.uniprot.txt  uniprottoEntrez.tab

Basically, this boils down to taking the flat list of Entrez IDs and using the binary combinations as the negative training set. Obviously, the random combinations that are known to interact from the positive training set will have to be removed from the negative. We may also want to sample a subset of this list of combinations to reflect our belief that the DIP does not contain every interaction for the proteins that it does know about.

To find the combinations of these proteins we can use the itertools Python package:


In [3]:
import itertools
import csv

In [4]:
#load in the flattened list of protein IDs
ids = list(flatten(csv.reader(open("flat.Entrez.txt"))))

In [5]:
#find all combinations of Entrez protein IDs
negids = map(lambda x: x, itertools.combinations(ids,2))

In [6]:
#examples
print negids[0:10]


[('6367', '207'), ('6367', '60684'), ('6367', '3840'), ('6367', '6616'), ('6367', '908'), ('6367', '7066'), ('6367', '200081'), ('6367', '51010'), ('6367', '56915'), ('6367', '23303')]

In [7]:
#how many combinations are there?
print "Number of combinations of the full human DIP Entrez protein list is %i."%len(negids)


Number of combinations of the full human DIP Entrez protein list is 5921961.

Removing overlap with positive examples

Since this is every combination, all the positive examples must also be in this list. It is therefore a good idea to remove them:


In [8]:
#first load in the positive examples:
posids = list(csv.reader(open("interacting.Entrez.txt"), delimiter="\t"))
print posids[0:10]
print "Number of positive examples: %i"%len(posids)
#remove entries that contain self-interactions:
posids = [(x,y) for x,y in posids if x != y]


[['5894', '596'], ['596', '7157'], ['2033', '7157'], ['2118', '2033'], ['1385', '1387'], ['4603', '1387'], ['2033', '7528'], ['6720', '1387'], ['101839559', '6929'], ['4654', '6929']]
Number of positive examples: 5498

The next part is a bit computationally intensive as the number of combinations is so large. Luckily, there is a stackoverflow post about exactly this. Turns out Python has a set type for exactly this kind of operation


In [9]:
#then can remove all the positive entries from the negative list using a the set type
posids = set(posids)
negids = set(negids)
negids = negids - posids

Unfortunately, this operation will only remove tuples that match between the two lists. Since the order is of the protein pair will also have to match between the two lists this will fail to remove elements where the order is reversed. Luckily, we can hack our way round this by reversing all the elements of the posids and repeating:


In [10]:
rposids = set([(y,x) for x,y in posids])

In [11]:
negids = negids - rposids

Downsampling negative examples

How many negative interactions should there be for each positive interaction. The number used by Qi was 600. Using that here, will define a variable so that it can be easily changed if required:


In [12]:
negtoposratio = 600
negN = negtoposratio*len(posids)
print "The number of positive examples is %i, therefore we require %i negative examples which can be sampled from the %i combinations available."%(len(posids), negN, len(negids))


The number of positive examples is 5103, therefore we require 3061800 negative examples which can be sampled from the 5904050 combinations available.

Sampling from a large set is possible, but it would require rewriting the set type - there is a stackoverflow post on this topic. Worth trying simpler methods that are less efficient to see if they run fast enough. The simplest way would just be shuffle and slice off as many samples as we want:


In [13]:
negids = list(negids)
shuffle(negids)

In [14]:
#extract negN samples from this for our training set
negexamples = negids[0:negN]

In [15]:
print negexamples[0:10]


[('5931', '71175'), ('5901', '7862'), ('409', '4055'), ('114803', '7486'), ('26157', '17918'), ('397', '21343'), ('4686', '396766'), ('9231', '2253'), ('14794', '2243'), ('51386', '4205')]

Saving both training tables

All that is left is to save the training tables. The naming chosen is:

  • training.positive.Entrez.txt is the name of the positive training set table
  • training.negative.Entrez.txt is the name of the negative training set table

These are both structured as described above.


In [16]:
csv.writer(open("training.positive.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: (x[0],x[1],1), posids))

In [17]:
csv.writer(open("training.negative.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: (x[0],x[1],0), negexamples))